feat(bench): token-reduction benchmark harness over frozen corpus (MCP-42) by Dumbris · Pull Request #747 · smart-mcp-proxy/mcpproxy-go

Dumbris · 2026-06-22T12:06:55Z

What

First, fully-deterministic slice of the roadmap-#19 benchmark harness (MCP-42): the token-reduction numbers behind mcpproxy's "massive token savings" claim. In-repo under bench/ (per board decision — no separate public repo).

Compares the static context-token cost of the three routing modes over a frozen tool corpus:

Mode	Tools in context	Tokens	Savings
`baseline` (all tools loaded)	45	1730	—
`retrieve_tools` (BM25 discovery)	5	596	65.5%
`code_execution` (orchestration)	2	513	70.3%

These are a conservative floor: input schemas are excluded uniformly (the committed corpus has none), which understates the baseline; and savings scale with tool count (real deployments expose hundreds–thousands of tools).

How

Reuses the Spec 065 frozen corpus (specs/065-evaluation-foundation/datasets/corpus_v1.tools.json) as a versioned, non-drifting universe (CN-002).
Tokenizer: tiktoken cl100k_base — already a repo dependency, reproducible, model-agnostic estimator. No new deps.
Real proxy tool definitions captured verbatim from internal/server/mcp.go into bench/proxy_tools_v1.json (provenance recorded in-file).
go run ./bench/cmd/bench → report.json + self-contained dashboard.html in bench/results/ (gitignored; reports never committed per Spec 065 CN-003).
Methodology, scoring rubric, dataset sources, known limitations, reviewer contact: bench/README.md.

Tests

TDD: go test ./bench/ — deterministic tokenizer, per-mode tool exposure, real savings in (0,1), baseline monotonicity. Race-clean.
gofmt, go vet, and golangci-lint v2 (strict CI config) all clean.

Scoped but NOT in this PR (tracked as follow-ups)

These need decisions / other lanes, so they're deliberately deferred (see bench/README.md):

Live run (docker-compose skeleton included): full schemas from GET /api/v1/tools for the exact headline number + Recall@k accuracy (reusing the Spec 065 retrieval golden set) + latency.
End-to-end task success with a pinned LLM — needs a pinned model + LLM-call budget.
CI publish-on-release-tag → public dashboard — Release/DevOps lane.

Related #MCP-42

…P-42) Ship the first, fully-deterministic slice of the roadmap-#19 benchmark: the token-reduction numbers behind the "massive token savings" claim. Reuses the frozen Spec 065 tool corpus (45 tools, 7 reference servers) as a versioned, non-drifting universe and tiktoken cl100k_base (already a dep) as a reproducible model-agnostic estimator. Compares the three routing modes' static context cost: - baseline (all upstream tools loaded directly) - retrieve_tools (BM25 discovery + call_tool variants) - code_execution (orchestration + retrieve_tools) over the corpus and reports per-mode savings. Real proxy tool defs are captured verbatim from internal/server/mcp.go into bench/proxy_tools_v1.json (provenance recorded). Emits report.json + a self-contained dashboard.html (gitignored; reports never committed, per Spec 065 CN-003). Conservative by construction: input schemas excluded uniformly understates the baseline, so measured savings (65.5% / 70.3% on the 45-tool corpus) are a floor. Methodology, limitations, and the scoped-but-not-yet-built follow-ups (live run with full schemas + accuracy/latency, LLM e2e, CI publish) are in bench/README.md. Related #MCP-42 Co-Authored-By: Paperclip <noreply@paperclip.ing>

cloudflare-workers-and-pages · 2026-06-22T12:08:03Z

Deploying mcpproxy-docs with Cloudflare Pages

Latest commit:	`9a92d71`
Status:	✅ Deploy successful!
Preview URL:	https://5553fbac.mcpproxy-docs.pages.dev
Branch Preview URL:	https://feat-mcp-42-bench-harness.mcpproxy-docs.pages.dev

View logs

codecov-commenter · 2026-06-22T12:12:05Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 63.50365% with 50 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
bench/cmd/bench/main.go	0.00%	22 Missing ⚠️
bench/report.go	44.82%	8 Missing and 8 partials ⚠️
bench/tokens.go	79.59%	5 Missing and 5 partials ⚠️
bench/proxytools.go	88.88%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-06-22T12:15:16Z

📦 Build Artifacts

Workflow Run: View Run
Branch: feat/mcp-42-bench-harness

Available Artifacts

archive-darwin-amd64 (28 MB)
archive-darwin-arm64 (25 MB)
archive-linux-amd64 (16 MB)
archive-linux-arm64 (14 MB)
archive-windows-amd64 (28 MB)
archive-windows-arm64 (25 MB)
frontend-dist-pr (0 MB)
installer-dmg-darwin-amd64 (21 MB)
installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page linked above
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 27971074417 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

… smoke test KimiReviewer finding 2: code_execution is at line 626 in mcp.go at 89f06b5, not 675 as claimed. Line numbers drift with unrelated edits and the actual function names are the stable identifier — remove all line numbers from the provenance comment to prevent future rot. KimiReviewer finding 3: add TestWriteReports_SmokeTest covering WriteReports output (JSON round-trips to Report, HTML is non-empty and contains all mode names). All 5 tests pass; golangci-lint v2 clean. Related #MCP-42 Co-Authored-By: Paperclip <noreply@paperclip.ing>

mcpproxy-gatekeeper

✅ Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).

This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.

Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).

Dumbris · 2026-06-22T13:08:00Z

CodexReviewer: changes requested (benchmark integrity). The fixture (bench/proxy_tools_v1.json) + README model only 6 proxy tools and assume minimal per-mode tool sets, but the real routing modes append the shared management tools (mcp_routing.go:337/437, mcp.go:656/780) — so proxy context cost is undercounted and the 65.5%/70.3% savings are overstated. Derive the per-mode tool catalog from the server builders (not a hand-maintained JSON) and re-run so the headline numbers are real. Details on MCP-42.

…cl. management tools (MCP-3161) The token-reduction benchmark scored only 6 hand-maintained proxy tools and omitted the shared management tool set (upstream_servers, quarantine_security, search_servers, list_registries) that both routing modes append via buildManagementTools. That undercounted the proxy-mode context cost and inflated the headline savings (Codex finding on PR #747). Replace bench/proxy_tools_v1.json with server.ProxyModeToolDefs, which builds the catalog from the live builders (buildCallToolModeTools / buildCodeExecModeTools in internal/server/mcp_routing.go) so it can never drift from production and always reflects the tools the agent actually sees. This also fixes a second drift: the fixture's retrieve_tools descriptions did not match the per-mode builder descriptions. Corrected figures over the 45-tool Spec 065 corpus (name+description only): retrieve_tools ~17% (10 tools), code_execution ~43% (6 tools). Updated README and notes; the schema-exclusion claim is no longer unambiguously conservative now that large-schema management tools are in the proxy cost. Tests: bench asserts both modes include the 4 management tools; internal/server pins ProxyModeToolDefs to the builders so the catalog can't silently drift. Related #747

mcpproxy-gatekeeper

✅ Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).

This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.

Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).

mcpproxy-gatekeeper

✅ Gatekeeper approval — MCP-42 benchmark harness on corrected head 9a92d71. Full mandated gate satisfied: CodexReviewer (first review) caught inflated savings (fixture omitted management tools); BackendEngineer fixed it (derive per-mode catalog from live server builders); KimiReviewer ACCEPT (model-diverse) + QATester PASS (MCP-3162) on this head + operator-verified. Honest numbers now: retrieve_tools ~17%, code_execution ~43% (were 65.5/70.3). CI green. Author≠approver.

…MCP-42a) (#748) * feat(bench): live benchmark run — full schemas + Recall@k + latency (MCP-42a) Extends the bench/ harness (PR #747) with a live run against a running proxy: - Exact token number: GET /api/v1/tools pulls upstream tools WITH full JSON input schemas; proxy-mode tools carry their live schemas via the extended server.ProxyModeToolDefs (BenchProxyToolDef.Schema). Schemas counted on BOTH sides so the headline savings is authoritative — and withheld (authoritative_headline=false) if any proxy tool lacks a schema, the MCP-3161 overstatement guard. - Accuracy: replays the Spec 065 retrieval golden set through the proxy BM25 search (GET /api/v1/index/search) and scores Recall@{1,3,5,10}/MRR/nDCG@10/MAP against graded labels (deterministic, no LLM). Field names mirror Spec 065 score-report.schema.json. - Latency: client-measured per-query search latency (p50/p95/p99/max) vs. the one-shot load-all-tools cost (server "took" is a 0ms stub). CLI: `go run ./bench/cmd/bench -live -proxy URL -api-key KEY`. Reports stay gitignored (CN-003). All metric math + the live client are unit-tested with httptest stubs; the docker-compose substrate is the live-reproduction path. Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(bench): preserve upstream schemas through /api/v1/tools baseline ConvertGenericToolsToTyped read generic["schema"], but every producer of the generic tool map (runtime/server GetServerTools, mcp.go) emits the upstream input schema under "inputSchema". The /api/v1/tools response therefore dropped every schema, so the MCP-42a live benchmark baseline was silently a description-only token count instead of the required full-schema count, while still able to emit authoritative_headline=true. - Read "inputSchema" first in the converter, keep "schema" as a legacy fallback. - Gate the live headline on baseline schemas too (BaselineSchemasCounted via anyHaveSchema): a systematically schema-less baseline now withholds the headline instead of claiming a full-schema baseline it never had. - Tests: converter preserves inputSchema (+legacy schema fallback); headline withheld when the baseline carries no schemas. Related #748 * fix(bench): conform live retrieval report to Spec 065 score-report schema Addresses CodexReviewer finding on PR #748 / MCP-3167: the live `retrieval` payload emitted flat metric fields, but score-report.schema.json requires nested `retrieval.metrics` + `retrieval.gate`. Restructure RetrievalMetrics into {metrics, gate} so live_report.json validates against the contract, proven by a new jsonschema-validation test (TestRetrievalMetricsConformsToScoreReportSchema). A standalone live run has no stored baseline, so gate.passed is true by construction (CI regression-gating against a committed baseline is MCP-3133). Co-Authored-By: Paperclip <noreply@paperclip.ing> --------- Co-authored-by: Paperclip <noreply@paperclip.ing>

mcpproxy-gatekeeper Bot approved these changes Jun 22, 2026

View reviewed changes

Dumbris merged commit 4a24175 into main Jun 22, 2026
38 checks passed

Dumbris mentioned this pull request Jun 22, 2026

feat(bench): live benchmark run — full schemas + Recall@k + latency (MCP-42a) #748

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): token-reduction benchmark harness over frozen corpus (MCP-42)#747

feat(bench): token-reduction benchmark harness over frozen corpus (MCP-42)#747
Dumbris merged 3 commits into
mainfrom
feat/mcp-42-bench-harness

Dumbris commented Jun 22, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Uh oh!

Dumbris commented Jun 22, 2026

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dumbris commented Jun 22, 2026

What

How

Tests

Scoped but NOT in this PR (tracked as follow-ups)

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying mcpproxy-docs with Cloudflare Pages

Uh oh!

codecov-commenter commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Build Artifacts

Available Artifacts

How to Download

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Dumbris commented Jun 22, 2026

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mcpproxy-gatekeeper Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloudflare-workers-and-pages Bot commented Jun 22, 2026 •

edited

Loading

codecov-commenter commented Jun 22, 2026 •

edited

Loading

github-actions Bot commented Jun 22, 2026 •

edited

Loading